This assignment is for ETC5521 Assignment 1 by Team emu comprising of Justin Thomas and Mayunk Bharadwaj.

1 Introduction and Motivation

Using the data provided on the ‘tidytuesday’ platform, our primary question is to identify the characteristics of a winning beach volleyball team for both males and females.

We believe that there might be differences in characteristics for a winning team compared to a losing team because of, for example, prevalence of beach volleyball in certain countries. Also, we theorise that taller and younger players may potentially be better at beach volleyball because of the competitive advantage they may have over shorter and more seasoned players.

Therefore, the secondary questions that will help us answer our primary question are:

In the following report, the reader will be able to find a description and information about the source and limitations of the data; information on how the data was cleaned; an analysis that will answer the above questions and a conclusion.

1.1 Limitations of Data Analysis

While going through the dataset, we found that the data was incomplete because there were multiple ‘NA’ values for individual player performance statistics. As such, observations which featured ‘NA’ values had to be removed as they were unlikely to be helpful in our analysis.

2 Data description

2.1 Questions

Primary Question

What are the characteristics of a winning beach volleyball team for both males and females?

Secondary Questions

  • Which countries have the most winning players?
  • What is the average age of players on a winning team?
  • What is the average height of players on a winning team?

2.2 Explanation of data being used

This data set provides beach volleyball statistics for men’s and women’s matches at two major tournaments, the Fédération Internationale de Volleyball (FIVB) Beach Volleyball World Championships and the Association of Volleyball Professionals (AVP) tour. The matches are played with teams of 2. In this data set, tournament information, player information, player performance statistics and match results are recorded. The data provided ranges from September 2000 to August 2019 and it has been collected by the data recorded at the tournaments.

The original data source created by Adam Vagner had initial data recorded from September 2000 to July 2017, however it has been periodically updated with the most recent update coming in May 2020. This can be found at https://github.com/BigTimeStats/beach-volleyball.

The structure of the data set is:

  • Rows: 76756
  • Columns: 65
  • Data types: Character, Numeric, Data and Difftime

There are 65 variables in this data set:

Variable Name
circuit
tournament
country
year
date
gender
match_num
w_player1
w_p1_birthdate
w_p1_age
w_p1_hgt
w_p1_country
w_player2
w_p2_birthdate
w_p2_age
w_p2_hgt
w_p2_country
w_rank
l_player1
l_p1_birthdate
l_p1_age
l_p1_hgt
l_p1_country
l_player2
l_p2_birthdate
l_p2_age
l_p2_hgt
l_p2_country
l_rank
score
duration
bracket
round
w_p1_tot_attacks
w_p1_tot_kills
w_p1_tot_errors
w_p1_tot_hitpct
w_p1_tot_aces
w_p1_tot_serve_errors
w_p1_tot_blocks
w_p1_tot_digs
w_p2_tot_attacks
w_p2_tot_kills
w_p2_tot_errors
w_p2_tot_hitpct
w_p2_tot_aces
w_p2_tot_serve_errors
w_p2_tot_blocks
w_p2_tot_digs
l_p1_tot_attacks
l_p1_tot_kills
l_p1_tot_errors
l_p1_tot_hitpct
l_p1_tot_aces
l_p1_tot_serve_errors
l_p1_tot_blocks
l_p1_tot_digs
l_p2_tot_attacks
l_p2_tot_kills
l_p2_tot_errors
l_p2_tot_hitpct
l_p2_tot_aces
l_p2_tot_serve_errors
l_p2_tot_blocks
l_p2_tot_digs

2.3 Data Cleaning

Our data was already in tidy format, so we did not have much cleaning to do. However in order to conduct our analysis, we have tidied the data set by removing variables that are not pertinent to answer our questions.

The methods we have used to tidy our data is as follows:

  • We deselected some variables from appearing in the data set and overwrote the original data set with the new tidied data set.

The reason for why we did not include variables such as match duration, or individual player performance statistics was because it did not fit with answering the questions we have laid out. Additionally, majority of the data for these variables were unknown, so it would not have been useful in our analysis.

2.4 Description of variables in data set as organised in tidy form

Variable Description
circuit Either AVP (USA) or FIVB (International)
country Country where tournament played
year Year of tournament
date Date of match
gender Gender of team
w_player1 Winner player 1 Name
w_p1_birthdate Winner player 1 birth date
w_p1_age Winner player 1 age
w_p1_hgt Winner player 1 height in inches
w_p1_country Winner player country
w_player2 Winner player 2 name
w_p2_birthdate Winner player 2 birth date
w_p2_age Winner player 2 age
w_p2_hgt Winner player 2 height in inches
w_p2_country Winner player 2 country
l_player1 Losing player 1 name
l_p1_birthdate Losing player 1 birth date
l_p1_age Losing player 1 age
l_p1_hgt Losing player 1 height in inches
l_p1_country Losing player 1 country
l_player2 Losing player 2 name
l_p2_birthdate Losing player 2 birth date
l_p2_age Losing player 2 age
l_p2_hgt Losing player 2 height in inches
l_p2_country Losing player 2 country
score Match score separated by a dash and matches separated by a comma, eg 21 points to 12 points is 21-12

2.5 Data sources

The original data is sourced from: Vagner, A. (2020, July 20). BigTimeStats/beach-volleyball. Retrieved August 22, 2020, from https://github.com/BigTimeStats/beach-volleyball

To load the data set, we had to use a GitHub repository that had the data set. The name of this repository is “Tidy Tuesday”. The data set was sourced from this repository: Mock, J. (2020, May 19). rfordatasciene/tidytuesday. Retrieved August 22, 2020, from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-05-19/readme.md

3 Analysis and findings

3.1 Which countries have the most winning players?

For both the AVP and FIVB tournaments, a team consists of 2 players. Each player in the team either comes from the same country or they can come from different countries. Thus, in this section, our analysis focuses on finding the countries that had the most number of winning teams. This will help us find the countries that had the most winning players.

In order to find our answer to this question, we first did some data wrangling to get the data set up for analysis. Then we followed the steps outlined below:

  • Firstly, find the distinct players in the variables w_p1_country and w_p2_country. This is to ensure we don’t get multiple rows of the same team with the same combination of player 1 and player 2.
  • After getting distinct players, we grouped the players by their respective countries.
  • Following this, we use the tally() function to count up the total number of teams by countries and dropped any rows that had no values in them.
  • This gave us a list of all the participating countries with a total count of winning teams per country.
  • We rearranged the data to show the total count in descending order and we also renamed the variables to make it more meaningful.
  • Lastly, we saved this as a new data set called “country” to be used later on.

Figure 3.1: Top 20 countries with the most winning teams

Figure 3.1 shows the top 20 countries with the most number of winning teams. We can see that the United States was the most dominating country with a total of 4200 winning teams. This means that 8400 players came from the United States and won. In distant second place, Brazil had 258 winning teams, and so 516 Brazilian players won matches. In a close third place, Germany triumphed with 200 winning teams comprising of 500 players. The remaining teams ranged from having 166 winning teams to 45 winning teams.

The clear winner here is United States and we can conclude that majority of the winning players in the AVP and FIVB tournaments hail from the United States.

We decided to dig further into United States. Although there were 4200 teams where both players in each team came from the United States, there were instances were 1 player came from the United States and another player came from a different country. This following section takes a look at the different countries that partnered with the United States.

In order to find the different countries that partnered with the United States, we followed the steps outlined below:

  • First, we filtered the rows from the “country” data set so that values from “Player 1 country” variable are “United States” or values from “Player 2 country” variable are “United States”.

This gave us a list of all the different country combinations where either player 1 or player 2 came from the United States and the other non-USA player’s country.

Table 3.1: Different combinations of countries that partnered with the United States
Player 1 country Player 2 country Number of teams
United States United States 4200
United States Brazil 44
Poland United States 34
Canada United States 25
Brazil United States 24
United States Canada 23
United States Poland 20
Virgin Islands United States 19
United States Australia 18
United States England 16
United States Puerto Rico 15
United States Virgin Islands 14
Puerto Rico United States 13
Greece United States 12
United States Israel 11
Italy United States 10
United States France 10
Philippines United States 9
Costa Rica United States 8
England United States 8

Table 3.1 shows 20 different country combinations, which is only a subset of the different countries that partnered with the United States. In total there were 66 different combinations.

Apart from both players coming from the United States, 44 different teams had player 1 come from the United States and player 2 come from Brazil. 34 teams had player 1 come from Poland and player 2 come from the United States.

From looking at the rest of the table, we can see just how popular the United States is as a competing country in volleyball tournaments. It not only registers in tournaments where both players come from the United States, but it also registers where only 1 player in the team comes from the United States and partners with a player from a different country.

3.2 What is the average age of a winning team and how does it compare with a losing team?

N.B. For the method used to complete this analysis, please refer to the commentary included within the code chunks.

The average age for male winning players 1 and 2 are 29.40 and 29.32 respectively. The average age for male losing players 1 and 2, on the other hand, are 29.08 and 28.95 respectively. There is no obvious bias to winning and losing due to age - as the average age for losers and winners is about the same. This might tell us something, however, about the average age of participation in professional male volleyball. If we plot every age of, for instance, male winning player 1 (Figure 3.2) and male losing player 2 (Figure 3.3) as examples, we see that the most commonly occurring ages are in the late 20s (28-29 year of age). Therefore, it is reasonable to infer that male volleyball players - due to the high levels of participation at those ages – hit their peak in their late 20s.

Now, let’s consider women’s volleyball. The average age for female winning players 1 and 2 are 27.98 and 28.29 respectively. The average age for female losing players 1 and 2 are 27.52 and 27.73 respectively. As was the case with the male game, age does not seem to strongly influence winning. However, it is interesting to note that their is a slight difference in average age of winning and losing players between the genders. If we take a look at the average age of winning player 2 in Figure 3.4, we can see that the average age of winning player 2 is less for females than males. Similarly, if we consider the average age of losing player 1 in Figure 3.5, we can see that the average age is also less for females than it is for males.

Figure 3.2: Ages of Male Winning Player 1

Figure 3.3: Ages of Male Losing Player 2

Ages of Winning Player 2 by gender

Figure 3.4: Ages of Winning Player 2 by gender

Ages of Losing Player 1 by gender

Figure 3.5: Ages of Losing Player 1 by gender

3.3 What is the average height of a winning team and how does it compare with a losing team?

N.B. For the method used to complete this analysis, please refer to the commentary included within the code chunks.

The average height for female winning players 1 and 2 are 70.91 and 70.85 inches respectively. The average height for female losing players 1 and 2 are 70.62 and 70.72 inches respectively. Although the average height for the losing players is less than the height of winning players, it is not a huge difference.

The average height for male winning players 1 and 2 are 76.28 and 76.39 inches respectively, compared to the height for losing players 1 and 2 of 75.98 and 76.15 respectively. Consider Figures 3.6 and 3.7, which display the difference in heights between male winning and losing players 1 (Fig. 3.6) and male winning and losing player 2 (Fig. 3.7). In both situations, the means in difference in height are pretty evenly centred around 0. so we probably can’t say height difference effects winning a volleyball game. We can however say that male volleyball participants are generally taller than female volleyball participants although through common sense we know this phenomenon is not unique to just volleyball.

Difference in Heights of Male Player 1

Figure 3.6: Difference in Heights of Male Player 1

Difference in Heights of Male Player 2

Figure 3.7: Difference in Heights of Male Player 2

4 Conclusion

After our analysis, we have concluded that a typical winning male volleyball team probably has both players originating from the United States, with player one having an average age of 29.40 and an average height of 76.28 inches with player two having an average age of 29.32 and an average height of 76.39 inches.

In addition, a typical winning female volleyball team probably has both players originating from the United States, with player one having an average age of 27.98 and an average height of 70.91 inches with player two having an average age of 28.29 and an average height of 70.85 inches.

5 References

Mock, J. (2020, May 19). rfordatasciene/tidytuesday. Retrieved August 22, 2020, from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-05-19/readme.md

Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida.

Vagner, A. (2020, July 20). BigTimeStats/beach-volleyball. Retrieved August 22, 2020, from https://github.com/BigTimeStats/beach-volleyball

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Zhu, H. (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra